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ABSTRACT 

The purpose of this City Data study was to explore 
sources of data outside the National Evaluation study of Foilon 
Through to deteraine the usefulness of such inforaation in the 
overall assessaent. The tvo aajor objectives of the study vere to 
judge the representativeness of the Project Follov Through saaf^le and 
to check on the credibility of inforaation collected in the National 
Evaluation. The first part of the present study used data and 
inforaation collected by sponsors and local sources and addressed 
issues of the credibility of conclusions about the achleveaent test 
success of Pollov Through children. The second part vas designed to 
use data and inforaation collected froa the National Title I survey, 
froa local sources, and froa the Stanford Research Institute data 
bank on Follow Through, and focused on the issues of the 
representativeness of the Follow Through group in coaparison with the 
population of low incoae children enrolled in Title I. (CS) 
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The purpose of the City Data study was to explore sources 
of data outside the National Evaluation study to determine the 
usefulness of such information in the overall assessment of 
Follow Through. Two major uses were anticipated: Firsts to 
judge the representativeness of the FT sample. If we found 
that the sample of tested FT children were broadly representa- 
tive of the pouplation of lower-income children reached by com- 
pensatory education programs; and if effects of FT overall ^ or 
specific projects or models within FT^ were found to help the 
children^ then we could generalize with more confidence about the 
probable effects of FT models on the larger population of low- 
income children. We would be on firmer ground in our judgment 
that a successfully implemented FT model which worked in Brooklyn^ 
N.y. could be exported to Chicago and have the same salutary 
effect. If 9 on the other hand^ we found that FT children who 
tested better in reading were different in important ways (e.g.^ 
race or income or family size) from children elsewhere^ we would 
have less confidence in proposing that a successfully implemented 
FT model could produce the same desirable results in other loca- 
tions. 

The second anticipated major use of data sources outside 
the National Evaluation was to check on the "credibility" of 
inforination collected in the National Evaluation. The state of 
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instrun^entation required to properly judge the effects of a 
vast and ambitious undert^iking like FT clearly lags behind the 
broad and complex goals of the program. From the first genera- 
tion of evaluation studies in compensatory education (roughly 
from 1965 to 1969) , we came to be more fully aware of just how 
uncertain our knowledge of both instrumentation and of choice 
and execution of research design is; of just how many obstacles 
and difficulties oven the best-^^lanned and thought out evaluation 
scheme would encountar. One important lesson we learned was 
that wherever and however possible ^ we should seek to use a 
variety of data sources collected independently. Clearly no 
single source should be entirely trusted. But a convergence of 
information coming from a variety of sources collected by 
different groups which all gave a similar picture should be more 
trustworthy. Likewise^ a serious divergence of findings would 
be important informatior enabling us to enquire more skeptically 
and in greater detail as to the reasons for the divergence. We 
would be less quick to judge a complex program a failure as the 
result of one evaluation using one type of measure of goal ful- 
fillment. 

The course of this Study of the usefulness of City Data 
has consequently broadened. Where once we believed that major 
reliance could te placed on data collected from a sample of 
large cities ^ ve now are inclined to look beyond large cities 
to additional data sources; to all local education agencies (LEA's) 

ERLC 



-3- 



which h^ivo i^sitos involved in the FT program; and to all sponsors 
who al^'o collect information about the effects of the Program on 
children they enroll. A corollary of this broader scope of 
exploration has been a more restricted sense of the usefulness 
of any one source of data. V7here once we felt that a small 
number of large cities could provide us with relatively reliable 
and valid information that would confirm or disconfirm findings 
from the National Evaluation^ we are nov/ less sure that such 
supplementary information, especially information coming from 
\ standardized achievement tests^ can by itself be of much use in 
assessing program effects in the immediate term. This is true 
whether one is interested in the long or short-term effects^ 

y whether one wants data to be obtained and interpreted quickly or 

\ 

\ only after careful analysis. 

\ 

\ 

\ 

Organization of the Report 

I This report is divided into two parts or substudies. Each 
substudy uses data from outside the National Evaluation to . 
explore an important issue in the overall assessment of FT. 
The first part uses data and information collected by sponsors 
and local sources and addresses issues of the credibility of 
conclusions about the achievement test success of FT children. 
[The second part was designed to use data and information col*- 
lected from the National Title I survey^ from local sources and 
from the SRI data bank on FT^ and focuses on the issues of the 
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the reprenontativcncss of tho FT group in comparison with the 
population of low income children enrolled in Title 1. ^ major 
appendix (Appendix I) provides the vesults of the "Availability 
of Local Data" survey conducted by Huron and collected from all 
local education agencies with FT sites in 1972-73. This 
appendix should be of use to FT-Washington or to outside 
researchers who seek to explore achievement testing issues 
that arise out of the FT program. Appendix II contains details 
of our recommendations on guidelines for sponsor annual reports 
submitted earlier to FT-OE. 

\ 

\ 

\ 

\ 
\ 

\ 

\ 

\ 
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PART I 

SPONSOR AND LOCAL DATA AND THH NATIONAL EVALUATION 



Apart from data collection activities specifically under- 
taken as part of the National Evaluation of FT^ the sponsors 
represent a potentially valuable and prolific source of informa*- 
tlon about the effects of FT on children ^ their families and their 
schools. The major emphasis of a sponsor's time and budget is 
properly directed toward program activities and not toward 
evaluation. Nonetheless^ all sponsors have at least thought 
about process and product evaluation and some sponsors have done 
a considerable amount of independent data collection and analysis 
of the effects on children^ families and schools. Sponsors 
vary widely in the amount and type of evaluation information 
they collect. 

Sources of information from sponsors . Since 1971^ The 
Huron Institute has received copies of the annual reports spon- 
sors are required to submit to FT-Washington. We have tried to 
obtain and examine all sponsor annual reports for the year 1971- 
1972 for this study. This has been our major source of sponsor 
information about the measured effects on children ^ families and 
schools. A list of the sponsors whose annual reports we examined 
is shown in Table I. With the exception of the University of 
Arizona f these include all the "major" FT sponsors; They cover 
a large majority of all FT sites. 



We looked at each annual report for ovidonce of measured 
effects I primarily on children and secondarily on families and 
schools. We placed special emphasis on measures of school 
achievement not because it is the only effect of interest and 
importance. Rather, our intorert focuses on this measure be- 
cause it permits us to make comparisons with the National Eva- 
luation data on achievement/ and becuase improved school 
achievement is certainly one of the most significant goals of 
the FT program. We chose the year 1971-72 because it is the 
latest year annual reports would be available to our study 
and because we had greater faith in the national evaluation 
test and in the meaningfulness of norms from the Metropolitan 
Achievement Test in the Spring of 197 2 • Finally, we expected 
that sponsor evaluation efforts would show the greatest refine- 
ment and sophistication in the latest possible year we could 
examine, especially for those sponsors who had been associated 
with the FT program since 1968 or 1969. 

In addition, we sent letters to all sponsors whose projects 
were included in the 1973 Interim Report Analysis of Selected FT 
Data, prepared by Abt Associateb. We asked each sponsor to 
comment on and provide supporting docuirentation "which either 
would strengthen or weaken the conclusions they reached about 
the achievement of FT pupils in your model," We suggested 
examples of such evidence: achievement tests collected indepen- 
dently either by the sponsor or the LEA for FT pupils; other 
test results and measures of pupil progress and development; 
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examples of problems of tent administration in the National 
Evaluation; irrocjularitieo in the SRI data bank. 

Having examined all the available sponsor annual reports r 
we chose to concentrate attention on the data from three 
sponsorr,^ the University of Oregonr the University of Kansas^ 
and the University of Pittsburgh. These sponsors were selected 
because they had the most complete and continuous achievement 
testing prograir.s of all major sponsors and because they repre- 
sented an approach ro early childhood and primary education 
which has been ca]led "structured." There is developing evidence 
from oxpcrimenta] preschool and primary programs and from Head 
Start and Follow Through Planned Variations v/hich suggests 
(v/ith some cunbiguity) that more "structured" programs in pre- 
school and early primary years seems to pro<3uce enhanced aca- 
demic achievement. (For a complete review^ see White ^ et al . , 
1912, Part III.) We felt that a more thorough study of sponsors 
who employ a more structured approach might provide valuable 
evidence confirming or disconf irming this tentative pattern 
of findings about "structure." Early in our investigation ^ the 
three structured sponsors were asked to send to Huron as much 
additional data as they had available on the results of their 
1971-72 testirg program. 

V7e report the results of this substudy in two sections. 
The first section discusses 1971-72 annual reports from sponsors 
whose reports wore made available to us. The second section 



discussen in greater detail tho achievement test findings from 
the three most highly structured sponsors. 



1. 



ANNUAL Kr.PORT SUMMARTES 



Table I 

1971-72 SponBor Anmml Reports Kxeunined 



Bank Street College of Kducc ' ion Approach 
Behavior Analysis Approach (Dnivcrsity of Kansas) 
California Process Model (California State Department 
of Education) 

Coqnitively Oriented Curriculum Model (Hi/Scope Educa- 
tional Research Foundation) 
Cultural Linguistic Follow Through Approach (University 

of California^ Riverr^ide) 
EDC Open Education Follow Through Program 
Florida Parent Education Model (University of Florida) 
Individualized Early Learning Program (U. of Pittsburgh) 
Language Devoloprcnt (Dilingual) Educ^tioi: Approach 

(Southwest Educational Development Laboratory) 
Mathemagenir Activities Program (University of Georgia) 
Responsive Educational Program (Far West Laboratory for 

Educational Research and Development) 
University of Oregon Engelmann-Becker Model 



Our overall impression of the annual reports that we 



large amounts of scattered ^ undigested^ uninterpreted and unana- 
lyzed information about children ^ f^imilies and schools. Some 
variety is wholesome and expected for^ after all^ they are 
operating in uncharted territory. There is a dearth of solid 
knowledge or even of concepts and frameworks which point out 
informatJion that is relevant. On the other hand^ it strikes one 



even more strongly that the format and content of many reports 
are such that it is hard to imagine making any headway so long 




that they are of uneven quality. Often they contain 
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08 information cortinuoB to bo reported in such a scattered 
and disorganized fashion.* 

Annual R<'port8 for 1971-72 by Sponsor 

We commont below on specific annual reports of six major 
PT Rponsors, leaving a discussion of the three most structured 
sponnors for the second nection. 

Hnnk Street . There is no evidence of any achievement 
testing of children in the Report. The only evidence of any 
kind of measuroirent activity is the ••Analysis of Communication 
in Education" (ACE) instrument^ developed by the sponsor for 
quantifying aspects of the open clasjroom. Information pro- 
vided about reliability Is insufficient for assessment. No 
mention is made about validity. No references are provided 
about other studies which used the same instrument. The study 
which reported the use of the ACE instrument covered all 14 
Bank Street FT sites, representing 78 classrooms and 468 hours 
of observations. While details about research methodology are 
inadequate, the little information provided casts doubt as to 
the meaningfulness of the comparisons presented. Tne non-FT 
group seems to have been opportunistically selected from two 
sites without any evidence of representativeness or comparability 
of children or teachers at these non-FT sites. FT classrooms were 
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♦ Appendix II contains a letter we sent to Ms. Frieda Denemark 

detail inq o\:r -u-jcont i cnr. for irnrovorcnts in the format and 
conicnt of i.pcnr'(>r aimutiJ roports which would make them more 
useful for research and evaluation purposes. 
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selected h<>cJusQ oil effective implementation of the Bank Street 
approach. Finally, the sampling of time for measurement of 
classroom interaction was not represrntr.tive. Only one day 
was observed y and teachers wore notified in advance when they 
would be observed. There is good reason to believe that 
observers were aware of which classrooms were part of the spon- 
sor sample. Thus the results on this instrument ^ while of 
po.^sible use to the sponsor and the local site people ^ are 
useless for FT research and evaluation purposes. Bank Street 
should be encouraged to use this instrument under improved 
experimental conditions if any valid generalizations are to 
be made about the effects of the model at all Bank Street sites. 
Other sponsors with an open edu'^ation approach should be 
encouraged to use the ACE instrument so that comparisons can be 
^ made . 

^ Additional material supplied by Bank Street covered 
i^esults of measurements on children or parents made before 
1971 or after 1972. No systematic sajnpling was undertaken and 
not all sites were included^ so it is not possible to assess 
the representativeness of sites selected or parerts or children 
measured. 

Educatioi al Development Center . The EDC annual report 
for 1971-72 contains^ no pupil or parent measurement data. It 

I 1 

reports! "community data" by site. Appendices detail services 
xendereci at each site and provide anecdotal evidence of pupils 
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tcachcr and p,.iront interaction with EDC porKonnel. 

In one appendix (XIV) and In a letter sent to Huron, the 
research director of KDC FT explains why achievement toBts are 
felt to be hostile to open education, particularly in the 
) indorcjartcn. Alternatives are raujcjcsted: looking at goals 
of the program apart from basic academic skills; using teachers' 
records to record invidual child development. However, the 
annual report contains no documentation of teacher records used 
in this way, nor of measurement of other aspects of child 
development. 

Far West Laboratory for Educational Research . In the 
early years of sponsorship, a considerable amount of standard- 
ized test baseline data and other measures of child cognitive 
and affective status were collected. However, after 1970, 
no pupil achievement data was collected. In 1971-72, pupils 
were not tested by the sponsor at all. Thus there is no way 
to compare sponsor and SRI-collected data. 

Although the report promises local data collection, using 
measures devised by the sponsor, in fact the only data presented 
are IQ scores on the vrpPI over two years for one community 
ending in 1971. In that study, no comparison groups were used. 

Florida Parent Education Model . There are a great deal of 
data in this report, but none directly measures changes in 
pupil acadcnic achiovcncnt. Most of the research v;ork on 
the Florida Model done by the sponsor concerns measures of 



chanqo In tcachi rs (Purdue Teacher Opinionneiirc) , in parent 
educators (How I .See Mynclf Inventory and Social Reaction Invcn 
tory) <>nd in mothers of target children (Home Environment 
Reviev;) . Children are ir,eacured on two "affective" instruments 
said to correlate with achievement ^ the "I Feel Me Feel" (IFMF) 
and the Cincinnati Autonomy Test Dattcry (CATB) . 

The IFMF yields scorer, on five factors (general adequacy, 
peer^ teacher-school, academic and physical). It was admini- 
stered pro-post in 11 centers. Children in the FT program 
made statistically significant gains on all five factors while 
non-FT children gained significantly on throe of five factors. 
Results on the CATB were disappointing, inconclusive, and 
difficult to interpret, partly because of the very small sample 
size. 

The format of this long report, without even a table of 
contents, makes understanding what is being reported extremely 
difficult. 

Cognitively Orientod Curriculum Model . The Comprehensive 
Test of Basic Skills, an achievement test, was given only to 
third graders in FT and compared with a third grade comparison 
group tested the year the FT program began so as to minimize 
treatment contamination. The results for Cohort I of FT show 
either no difference (2 comparisons) or a difference favoring 
the control group (1 comparison) . Other comparisons made for 
children who entered in the 1968-69 year showed similar results 



The trends, incidentally, alniofit consictcntly favored the con- 
trol n. 

On the Stcinf ord- Binet IQ te55ting, conducted every year, 
statistically significant gains in mean score from fall to 
spring for the FT group v/ore found in 5 of 10 centers. Sta- 
tistically significant higher mean scores of third grade FT 
children compared with a group of non-FT third graders were 
found in 3 oi 5 centers. 

One appondix of the annual report contains sections which 
discuss re£>ults of local evaluation of achievement test scores 
in three sites. Unfortunately, none of these sites is part 
of the National Evaluation sample for the years that test scores 
arc reported locally. Ti;us, comparisons between local and 
National Evaluation scores could not be made. 

Mathemagenic Activities Program . There are no data what- 
soever regarding any child measure, parent measure or classroom 
measures. The only data consist of ratings of sites on project 
assessment and implementation criteria. 

Recommendations on Sponsor Annual Reports 
Many sponGOrs included no measures of pupil development 
at all. Some sponsors neasured aspects of pupil cognitive 
growth at a few sites, but only two measured academic achieve- 
ment. Only Bank Street attempted to systematically measure 
clcirr.rccr. j.n^. '. rc;ct.ion for ci r;nirple of nitos which v;as, unfortu- 



nately, non-rcprosontative. For evaluation purposes the 
results are uselcsii since Hank Street sites wore compared with 
only two non-FT sites opportunistically selected. Only one 
sponsor, High/Scope, mentioned the results of local testing. 
Clearly, there is little meaning and interpretable information 
about pupil change in achievement or even cognitive development 
in the 1971-72 sponsor annual reports we exeunined. There is 
no reason to suspect that the annual reports not reviewed (i.e., 
those not included in Table I) are any different. 

One can recognize that sponsors have varying goals and 
place different emphasis on aspects of child, family and 
school development. Many of these aspects are at present im- 
perfectly measurable. However, measures do exist for many 
aspects of child and classroom development. One way to perfect 
^imperfect measures, to learn more about the validity of such 
dnstruments, as well as to learn about the effects of program 
^ctivities, is to use a limited number of the best existing 
measures systematically over a laxge number of subjects or 
sites. 

There remains the criticism that measurement of children, 

as such, is hostile to the kind of experience that some models 

are seeking to create in their classrooms. We quote here from 

a letter sent to us by the EDC Evaluation Research Committee: 

I Th^ teacher following test directions talks a great 

deal to children but tells them nothing that would 
J interest them. Our children arc not accustomed to 
^ detailed directions about how and when to do their 
work. They are expected to proceed independently. 
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Nor arc they accustomed to sitting at separate desks 
for long periods of timo with no communication with 
their fellow students. 

Many of the children in our program v;ill be 
handicapped on this kind of pencil-paper test because 
we stress sharing, playing and working with a wide 
range of materials, art and music activities, sand, 
water and block activities, rather than workbook or 
mimeographed "test-like" materials . 

The only proper response to such a critique is that a program 
model of this type should never have been included in a Planned 
Variation experiment. If the sponsor cannot imagine and then 
devise any type of measurement with validity and reliability 
which would not infringe on the child's accustomed mode of 
operation, then clearly the program model cannot be evaluated 
in an experimental situation. 

In the letter to Ms. Deneir.ark we detail our recommendations 
\ for improvements in annual reports (sec Appendix II) . V7e suggest 
\ there that sponsors be asked to specify in advance, in their 
proposals, what measures they intend to use. FT should provide 
the necessary funds and technical assistance to sponsors both 
i|n the formulation and in the execution of research design pro- 
posals and plans. Sponsors should be required to collect and 
report the results of testing undertaken at each site by the LEA 
for all children in the FT grades, broken down by FT and non-FT, 
with further refinement of the non-FT population if possible. 

I 

\ ! 



2. ACHIEVEMENT TEST FINDINGS FROM THE "STRUCTURED" SPONSORS 



We report here comparisons of scores on achievement tests 
taken by the same children at about the same time in the more 
highly structured programs sponsored by the University of Oregon 
(Engelm.ann-Becker Model) , the University of Kansas (Behavior 
Analysis Approach) , and the University of Pittsburgh (Indivi- 
dualized Early Learning Program) . All comparisons are made in 
the grade equivalent (GE) metric. V7hile serious problems 
exist in the interpretation of grade equivalents, there is 
simply no other way to make comparisons given the information 
provided by the sponsors. 

Using the GE metric to draw inferences about pupil growth 
or status involves several difficulties. Depending on the 
correlation of grade with achievement in the particular skill 
area measurt-d, a GE six months behind the norm may represent a 
serious lack of achievement or just one or two questions missed. 
If the average scores are near the national norm, GEs fluctuate 
much more . in relation to rav7 scores than at far out ends of 'the 
distribution. GEs for tests taken betv:een points of standardi- 
zation are linearly interpolated values, yet linearity of growth 
throughout the school year is just an assumption — one which ha 
little empirical support. Standard scores are thus far more 
desirable, but we had no data from which standard scores could 
be coiT'putod. 

The three sponsors tested either all or a random sample of 
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the children on the Wide Range Achievement Test (WRAT) in the 
Spring of 1972. Additional comparisons from previous years of 
sponsor testing and from local testing are included ^ although 
emphasis has been placed on the spring 1972 testing point. 
The comparison is between a sponsor-reported (or locally 
reported) score on an achievement test and the SRI -administered 
Metropolitcin Achievement Test (MAT) . Only children in grades 
one through three are included since there are no validated 
GE's at the kindergarten level for the MAT. 

The charts v/hich follow present data on test scores for 
pupils in sites sponsored by the Universities of Oregon ^ 
Kansas and Pittsburgh^ the highly structured sponsors. Each 
row represents site scores on achievement tests at one -time of 
test administration for a specific cohort at the designated 
grade level. Sites are separated by single dark horizontal 
lines. Underlined are the GE are on the various tests. Thus^ 
the first rov; of the charts in Table II gives the Spring 1971 
test results from the second grade FT pupils at the Dayton ^ 
Ohio (University of Oregon) site. The tests compared are the 
WRAT reading subtest and the Stanford Reading Achievement Test. 
The SRAT was administered by the local district (note the column 
the information appears in) to 158 FT second graders as well as 
to a "comparison'^ group of 12C second grade pupils in schools 
adjacent to the FT project school in Dayton. The GE comparison 
s>iov;s FT pupils at grctde 3.5 on the V1RP\T compared with grade 
2.1 on the SRAT^ Primary IT. The FT students scores .1 GE 
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I'^vcl below the conparir.cn group on tho fiR/.T, but both groups 
scoren v;ell bolov; tji-adc ]ovol (2,8 or 2.9). 

Discussion of the CcMr.pari r;onn . The comparisons will be 
discuG£;cd mainly in terriF of differences in grade equivalents 
for the iJamo FT children tcjking achievement tests about the 
same time. However, it ii> possible, if one has faith in the 
compni ability d. the SRI-doiiignatcd NFT group, or of the 
district-designcitcd "comparison" group, or of the national 
norm tables, to : ike forrr.al and inforr.rl -.o^.-parisons between 
FT children and children not in FT. 

FT children who take the WRAT at about tho same time 
as another standardized achievement test score higher on the 
WRAT than on the other test. Overall, the advantage in GE for 
taking the WRAT seems to be about 1 year GE in reading and about 
1/2 year GE in arithmetic (Table III) . In one dramatic instance, 
the Grand Rapids (7.04) third grade. Spring 1972 reading test 
scores, the V7RAT GE is 5.8 compared with the MAT GE of 3.0, ob- 
tained by local testing — a difference of 2.8 grade Icvell In 
only one case do FT children ever do worse on the WRAT than on 
another standardized achievement test (in site 12.01, 3rd grade 
total math) . 

Several other items of interest should be noted. The 
number of children tested by the sponsor, by SRI or by the 
local district for the same FT cohort is not the same. Some 

diffei: ii. f.o I..-" cxpoct^.c; since tl;o tciUi; were not given at 
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TAP.i.n III 

DIFFERENCES Ii: CRADE EQUIVALENT 
ON SPONSOR vs. LOCAL OK NATIOK'AL EVALUATJON DATA 



Read i ng D i f f eronc c> n 



First Grade 


Second Gradn 


Third Grade 


Site 


& Yoar 


Diff .* 


Site 


& Year 


Diff.* 


Site & Year 


Diff.* 


7.01 


(1971) 


+ 0.3 


7.01 


(1971) 


+ 1.4 


7.04 (1972) 


+ 2.8 


7.03 


(1972) 


+0.8 


7.03 


(1972) 


+ 1.3 


7.11 (1972) 


+1.4 


7.04 


(1970) 


+ 0.8 


7.04 


(1971) 


+ 1.7 


7.12 (1972) 


+2.3 


7.07 


(1972) 


+0.8 


7.0b 


(1972) 


+ 1.1 


12.01 (1972) 


+0.4 


7.08 


(1972) 


+0.6 


7.11 


(1972) 


+ 1.4 






8.01 


(1972) 


+0.3 


8.03 


(1972) 


+ 0.3 






8.03 


(1972) 


0.0 


8.04 


(1972) 


+ 0.4 






8.04 


(1972) 


0.0 


12.01 


(1972) 


+0.3 






»^.08 


(1972) 


+0.4 


12.03 


(1972) 


+ 0.2 






2.04 


(1972) 


+0.3 


12.04 


(1972) 


+ 1.5 












12.04 


(1972) 


+ 0.8 












12.05 


(1972) 


+ 0.1 







Arithmetic Differences 



First Grade 


Second Grade 


Third Grade 


Site 


& Year 


Diff.* 


Site Si Year 


Diff.* 


Site 


& Year 


Diff.* 


7.03 


(1972) 


+0.4 


7.03 (1972) 


+0.6 


7.04 


(1972) 


+0.5 


7.04 


(1970) 


+0.4 


7.07 (1972) 


+0.9 


7.11 


(1972) 


+0.6 


7.07 


(1972) 


+0.7 


7.08 (1972) 


+0.3 


7.12 


(1972) 


+0.7 


7.08 


(1972) 


+0.5 


7.11 (1972) 


+0.5 


12.01 


(1972) 


-0.3 


8.01 


(1972) 


+0.7 


8.01 (1972) 


+1.2 








8.03 


(1972) 


+0.1 


8.03 (1972) 


+0.5 








8.04 


(1972) 


+0.4 


8.04 (1972) 


+ 0.4 








8.08 


(1972) 


+ 0.4 


12.04 (1972) 


+ 0.1 








.2.04 


(1972) 


0.0 












* + 


means 


sponnor G. 


E. hi y her than 


comparison ; 








means 


sponsor G. 


E . lower than 


comparison . 







o 
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TADLE III (cont'd) 



Svinunary - PfacUng Pi f f oronr cf; (number of sponsor sites in parentheses) 



Sponsor 


ri ret Grade 


Second Grade 


Third Grade 


U« of Oregon 
U, of Kc'innas 
U. of riLtsburgh 


0,66 (5 nites) 
0.10 (4) 
0.30 (1) 


1*38 (5 sites) 
0.35 (2) 
0.58 (5) 


2.17 (3 sites) 
0.40 (1) 


Average o^'^ r 
all sites 


0.40 


0.90 


Overall 
1.70 average 
1.0 


Sununary - Matheniatict; Differences (number of sponsor sites in ( ) ) 


Sponsor 


Firnt Grac'o 


Second Grade 


Third Grade 


U. of Oregon 
U. of Kansas 
U. of Pittsburgh 


0.50 (4 sites) 
0.40 (4) 
0.00 (1) 


0.58 (4 sites) 
0.70 (3) 
0.10 (1) 


0.60 (3 sites) 
-0.30 (1) 


Average over 
all sites 


0.40 


0.57 


Overall 
0.38 average 
.50 
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cxactly tho same time, liargo and unexplainoc-. differences are 
noted in the far right column for site 7.11 (2nd grade, spring 
1972), site 7.12 (third grade, spring 1972), site 8,03 (first 
and second grade, spring 1972), site 12.01 (third grade, spring 
1972) and site 12.04 (first grade, spring 1972). When differ- 
ences are so large, the possibility of nonrandom "attrition" is 
certainly present and should be investigated. V7e have called 
these discrepancies to tlio attention of SRI and the sponsors 
and they are attempting to resolve them.* 

Looking in detail at the sponsor contributions to the 
overall GE differences (Table III, Summary) , it appears as 
though the differences between sponsor V7KAT and the National 
Evaluation MAT scores is far greater in Oregon sites than in 
Kansas and Pittsburgh sites. But even cunong the latter two 
sponsors, differences are sometimes considerable. 

The difference in GE between FT and a comparison group, 
either the National Evaluation's NFT group or a locally created 
comparison group, usually favors the FT with the startling excep 
tion cf Tupelo, Mississippi (7.11) which must be suspected of 
initial non-comparability. But the differences between FT and 
a comparison group are rarely as large as differences within a 
single FT group as a result of merely taking a different achieve- 
ment test. The implications of this pattern may be profound. 
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in site F.03 (Phi Icfdolphia) niay be due to a 
decision by SRI not to test all classes at this site since the 
N is so large. 
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t 

I 

Vvhat it nj.c fht sug cK^st is that measured increase in achieve- 
ment y which v .T> are trying to as.qess in the National Evaluatio n y 
may be more a f unction of the particular t e .st administered t han 
of the Progrcan itself . If greater differences in GE are found 
betvreen achioveinont tests administered to the same children at 
the same time than are found between different children, FT 
compared v/ith non-FT, it becomes clear than any conclusion 
arrivod at concerning the effects of FT using just one achieve- 
^ ment test may be a conclusion not about the effects of the FT 
program but about the sensitivity of that particular test to 

\ 

\ the effects of FT or of a particular model in FT. 

For example, one can properly infer from the WRAT that 
^ FT children (in the highly structured models v;e have comparison 
\ information on) generally score well above grade level in the 
\. first grade (reading, 2.5; arithmetic, 2.2) and in the second 
grade (reading, 3.3; arithmetic, 3.0). In the third grade they 
are at grade level in arithmetic (3.8) but still one year above 

^rade level in reading (4.8). For the Scime children on the 

i 

KAT total reading and total math subtests, the children are 
above grade level in total reading (2.1) and at grade level in 
total math (1.8) in the first grade. They are behind grade 
level in the second grade (2.6 for total reading and 2.4 for 
total math) and well behind in the third grade (3.1 for total 
reading .and 3.4 for total math). 

! I ■ 

1 

, Conside rations in I n terpretina the Finding s . I f i t i s 
\ 
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granted thcit the comparisons presented above are intriguing 
enough to pursue — given that the differences in a single 
group of FT children are typically much greater than differ- 
ences bc^tween FT and non--FT — we ought to speculate and, if 
possible, investigate why this is so and whether this affects 
the credence we give to the MAT, the WRAT or to any standardized 
test of achievement. 

To anticipate the conclusion vje have reached, it has 
created greater skepticism than v;g had initially about the 
use of all standardized acl)ievement tests in making meaningful 
assertions about the affect of compensatory programs like FT 
on school achievement. Among other things, it further suggests 
to us that no single standardized test ought to be relied on in 
forming a complete judgment about the effects of the FT program 
on children's achievement in school. 

(1) Differences in test content: Thn simplest explanation 
for the WRAT-M/^T reading discrepancy (which covers the majority 
of comparisons in Table II) is that the tests measure different 
skills, especially in reading, where the GE discrepancy in tests 
is greatest. 

A brief inspection of the MAT Primary I and II and 
Elementary batteries reveals important differences in the reading 
skills tested as between the MAT and WRAT. The WRAT requires 
that a student be able to name and recognize letters and then 

rood olcv.d incividiuil words of varying difficulty. In the 
Primary I battery, for example, the total reading score consists 
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of the sum of score from tV70 subtests — word knowledge and 
reading. Word knov/ledge requires an c\bility to read words 
silently, understand their meaning (s), and then interpret 
pictures, matching the appropriate v/ord meaning vzith the pic- 
tures. It also requires mciny complex abilities in test--taking 
and follov/ing directions. Reading, in Primary I, requires 
comprehension of sentences, together with matching sentences 
v;ith pictures. The child is also required to read stories, 
answer questions bayed on the literal meaning of the stories 
and draw inferences from the stories to answer other questions 
not based on the literal meaning of the stories. 

The MAT total reading score obviously demands a large 
number of complex skills. Many of these skills require abilities 
to interpret pictures and make other kinds of judgment and above 
all, ability and motivation to follow directions. These skills 
are judged by the test publishers and by some authorities in 
reading instruction to be a crucial part of what it means to 
read in the early primary grades. Other authorities disagree, 
believing that the essence of early reading ability is the de- 
coding of words and comprehension of word meaning. The debate 
between these two interpretations of reading, as exemplified by 
the MAT and WRAT tests, cannot be resolved by pointing out the 
predictive validity of the I'lATv which is considerable. The 
MAT may test important school-related skills which predict 
well to future school achievement, but whether the 

ERLC 
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MAT reading subtests are efficient and accurate tests of reading 
ability is not an issue in predictive validity. To put it 
another way, it may be that children v;ho can read well will 
not score high on the MAT but will score high on the V7RAT • If 
these children do not succeed in later school work, it may be 
because they lack other skills which the MAT reading test taps, 
but not becciuse they can't read well. Such children will not 
need a compensatory reading program. Granted the importance 
of succeeding in tem^s of existing public school criteria, they 
will need entirely different kinds of instruction. Thus, to 
diagnose reading deficiency from a low score on the MAT would 
be wasteful of resources in compensatory education. 
^ (2) Differences in test administration: The WRAT as 

^^administered by the highly structured sponsors is intended to 
^be given individually; the MAT is group administered. It is 
likely that children who might not be motivated to perform 
t^o capacity in a group situation might do so when tested in 
d one-to-one relationship, especially if they know and trust 
the tester. It is also likely that children would be more * 
easily discouraged from attempting items if directions are 
complex and incorrectly or incompletely understood. The MAT, 
as mentioned before, makes far greater demands on the child in 
following complicated directions both in the interpretation of 
<6ucstioijs and in the marking of responses. 

^ Vic solicited ccr.m.onts from sponsors about the procedures 
used by^SRI in test administration in the Spring of 1972. Did 
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the children understand the directions? Did irregularities 

occur? VJero test co}^ditionrj , the testing environment, adequate 

and free from serious distraction? Several sponsors replied 

with general comir.ents disapproving of testing children per se. 

This was deemed not relevant to the issue we were exploring. 

EDC raised questions about the qualifications and training of 

the testers who administered the SRI tests. 

At one of our sites parents complained that the 
testers v/ere employed V7ho were associated v:ith 
another model. Many testers tell our children 
they are "going to play a game with them" (even 
though the test direction do not specifically in- 
struct them to do so). These tests are not games. 
Our children expect to be treated honestly. One 
child, when the tester told him he "was going to 
play a game with him," listened for a little and 
then said, "If this is a game it's a dumb game, 
mister," and got up and left* 

No sponsor provided any evidence of irregularities, although 
occasional claims were made that children were tested in large, 
noisy auditoriums, etc. From the generally disappointing re- 
turns from sponsors, we have no way of knov;ing v/hether children 
did understand the directions, irregularities did not occur and 
testing conditions were reasonable, or alternatively, whether 
the sponsors don't bother to learn and collect such infonnation. 

(3) Differences in norming procedure: The MAT is normed 
on a national sample of schools stratified geographically, by 
size of school location, by public or private sponsorship and 
by SES. Adjustments were made to assure that the sample was 
representative of the na clonal population on m.ental ability 
test scores. 
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For the W^T, such information on the norming procedure 
is scanty. The manual states that "no attempt was made to 
obtain a representative national sampling. Nor is such a 
sampling considered essential for proper standardization." 
V7hilG this is true as far as it goos, it is incumbent upon 
the test publislier to provide information about the character- 
istics of the norming sample so that a user can determine the 
comparability of his sair.ple to the norming sample. The WRAT 
publisher does provide information on the age and sex of the 
norming sample. It claims that IQ information was used to 
develop norms corresponding to the "achievement of mentally 
average groups v/ith representative dispersions of scores..." 
because of the incomplete information furnished in the V7PA.T 
manual about sampling, the test has been heavily criticized 
(Euros, 1972) . 

(4) Other explanations for differences: There are two 
forms of the V7RAT, of which only one is appropriate to the 
grade level of FT children. By the third grade, some FT 
children will have taken the same sponsor-administered WRAT * 
test as many as four times. They will also have taken the 
SRI-administered WRAT, a modification of the publisher's test, 
two or three times. The possible disortion of scores owing to 
test familiarity is compounded by the danger of increased 
susceptibility to teaching the test items. On the other hand, 
three forms of the MAT exist at four different battery levels 
appropriate to the grade range of FT children. This clearly 
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minimizen both of the diritortions v.^hich may have occurred in 
the WRAT. 

Another possible explanation for difference in GE between 
the V7RAT and the MAT is that not all the same children are 
taking both tes-.ts. We have noted previously that sites exist 
v;here the differences between the number of children tested 
by the sponsor and those tested by SRI is considerable. Even 
v;here the? difforonce in N's is less dramatic, the mean grade 
equivalent might have been raised or lowered if there were 
systematic differences between tested and non-tested children. 

We know, for example, that most testing by the sponsors 
took place in the ninth month of school during 1971-72. In 
some sites this meant that children were tested just one or 
two weeks before the end of the school year. This might have 
resulted in biasing the scores, if lower-achieving students ' 
attendance drops off at the end of the school year as we 
suspect. 

Another systematic decision that has been made at some 
sites is not to test children when it was felt that they v/ould 
be unable to attempt a test above their ability level. We know 
that this happened in New York City on the ^3AT. There/ the 
school system's policy is that children whose achievement is 
considerably below grade level should not take the MAT battery 
approrpiate to their actual grade placement. For example, a 
.second grade child reading at early first grade level v;ould not 
take the iM7*T Pjriinary II reading subtests along with the rest of 
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his classmates. When clciss means were computed for the second 
grade in New York City, the reported mean score was for those 
children who took the Primary II reading subtest and not for 
all children in the clasnroom. The reading score, therefore, 
looks impressively high, but the N is very small. Having dis- 
covered this, we excluded the 2nd grade New York City reading 
scores from our comparison. Unfortunately, we do not know the 
extent to which this practice was followed, either formally or 
informally, in other sites either on the MAT or on the WRAT. 
But the validity of the class mean scores either for the MAT 
or the WRAT remains open to question until this information is 
obtained for every site. 

^ Recommendations on Achievement Test Findings. An 
\ ^ 

•interesting, provocative and potentially important pattern of 

\ 

findings emerges if one compares the scores of the same 
Children in highly structured models taking different reading 
a^nd math achievement tests at the same time. Looking at one 
achievement test, the V7RAT, children are well above grade level 
and national norms in reading past the third grade, and well 
ahead in math until the end of third grade when they are at 
the national norm. Looking at another achievement test, the 
MAT, children are doing far less well in the secondgrade. By 

the end of the third grade they are considerably below national 

t ' 

Aorms. |If we were to judge the success of FT on the basis of 

j 

4chicvorjent tents in reading and math alone, we v;culd bo incline 

\ ; 

t 
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to give FT an overwhelming vote of confidence using one tc^;t 
(the v:rAT) and to hcwo sowe serious reservations using another 
test (the MAT) . Wliich tost are V7e to believe? 

At this time, the answer has to be "neither." V7e have 
discussed and analyzed a series of possible reasons for the 
one-year overall GE discrepancy in reading and the 1/2 year 
overall GE discrepancy in math. Any one of them could be used 
to explain av;ny the difference. In some cases, there are clear 
indications that the MAT and its procedure for administration is 
considerably underestimating real effects of the sponsors. There 
are equally clear indications in other cases that the WRAT might 
be overestimating effects. 

Some possible explanations could be tested if enough 
information were available from the sponsors, from local sites 
and from SRI. For other explanations, no satisfactory resolu- 
tion seems attainable. It is difficult to imagine resolving 
the debate bc^tween those who argue that the "true" meaning of 
reading skill in the early primary grades is properly tested by 
the complex items of the MAT subtests and those who testify that 
reading skill in the early primary grades is more basically a 
matter of decoding as tested by the VJRAT. For yet other explana- 
tions, knowledge from applied research that has not yet been done 
on the nature, interpretation and behavior of achievement tests 
is absolutely necessary before we can decide which test is more 
valid and which means of analysis are more appropriate in mea- 
suriiig proi-jr oi.". oLfects. 



Perhaps the raost important lesson of this study is that 
National Evaluation data ^'ire just one of a numbo^r of sources of 
fallible information about the effect of FT on the academic 
achievement in basic skills of priinary school children enrolled 
in the Program. While acliievement test data in the National 
Evaluation ought to be collected with meticulous care and 
analyzed using the most sophisticated techniques availcible, 
they v;ill never be able to provide at the present the kind of 
unambiguous indication of the effect of FT on skill achievement 
that is desirable. 

One implication is that FT Research ought to be involved 
either directly or indirectly in the kind of applied research 
in testing and methodology tnat will lead to less cimbiguous 
interpretations of data. 

\ A second implication is that FT Research ought to spur 

efforts throughout the FT Program at collecting better and more 
Cjomplete achievement test data both from sponsors and from LEA's 
T'jiis involves not only providing funds for sponsor and LEA data 
collection but more important, providing technical assistance 
and uniform guidelines and standards which would enable FT and 
outside research personnel to use information from sponsors, 
lea's and the National Evaluation in order to arrive at better 
and more accurate escimates of the effects of FT on children's 

^chievenjent . 

j 

A third and related .ir^n] icat ion if? that until data collnc- 

\ 

tion procedures and testing guidelines are crccxted, the collec- 
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tion of massive amounts of tc£?t information from cities or 
other local education ngencicii at FT sites will be of little 
use. It will result in a deluge of ncv; numbers and create 
additional problems of analysis and interpretation which v;ill 
never be resolved. The first priority ought to l^e the gathering 
of better infomiation from sponsors. Once this information is 
collected and analyzed and bugs ironed out^ it will then become 
feasible to embark on the more ambitious undertaking of con- 

^ firming these findings using additional sources of data from 

\ localities. 



\ 

\ 
\ 

\ 
\ 

I 

I 

1 

i 

\ 
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PART II 



THE Rl^PRESENTATIVENKSS OF 
THE FT GiVMPLE 



An exploration of the early history of Follow Through 
clearly indicates that by 1968 ^ it had been decided that FT was 
to be an experimental program. It was designed to produce 
useful information for the time when the Program could be 
expanded nat.ionv:ide as a service program for disadvantaged 
children, their fairdlies and the schools that served them. A 
primary purpose of FT since then has been to compile evidence 
to help guide decisions regarding the design and implementation 
of compensatory educ3ition. For this reason , issues concerning 
the generali^abilj ty of findings from the FT population to the 
larger target population of poor children are of crucial 
concern for policy makitig. 

Our investigation of the question is divided in three sec- 
tions. Fach section attempts to address the question: How 
representative is a sample of children (from whom we have data 
about FT effects) of a larger population? The variables on 
which representativeness is assessed , as well as the samples , 
vary from section to section. 

The first section looks at a sample from the entire popu- 
lation of FT children v;hose parents were interviewed. NORC 
interviewed parents of entering children (kindergarten or first 
grade) each year from the Spring of 1970. The section compares 
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Heloctod baclccjround .i nforniation about tho child £)nd his family 
with nimilar backcii'oiMKl inforn<;tion about children and their 
families in Title I schools.* 

The second section makes a different FT-Title I comparison. 
Hero we look at tho IT and Title I populations in two large 
citioGr Nov; York and Dciltimoro. The comparison variable is 
achievement scores cxpiesGed in G.E. in the upper grades. 
Achievement scores in the upper grades are used as a proxy 
variable for the host of child, family and school factors which 
influence school achievement. The logic of using this variable 
is: If v;e look at achievement scores in grades where the FT 
program has not yet reached ^ v;n have a relatively clean measure 
(barring massive year-to-year SES mobility) of hov; similar 
children wore before the advent of the FT intervention. If FT 
schools and non-FT Title I schools show similar school achieve- 
ment profiles in upper grades^ it is highly likely that the 
similarity will extend to the lower grades where the FT program 
has begun. Note^ however ^ that even if similarity is founds we 
•can only generalize to these two cities and possibly to other 
large cities like New York and Baltimore. The other major draw- 
back of this comparison is that if not all children in the lov;er 
grades are in the FT program^ ond if there is a selective process 
for picking children to receive the FT program^ comparability 

* This section is not included in the present report since we 
have not yet obtained printout:: from Abt on the parent interviews. 
As soon as Aht providers us wit.li this inf oi:n\ation , we v;ill be c\ble 
to make the comparison v;ith Title I and will forward that section 
of the report. We expect that TVbt will furnish us with these 
data in the next two weeks. 
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ir. brought into qucntion. On the evidence we have available, 
it is unu.sual for the FT prey ram not to cover the entire grade 
of an elementary school, and hence this problem would not arise. 

The third section, a comparison of pupils v;ithin the 
National Evaluation i;ample, addresses a related but distinct 
issue of comparability. We tend to astiume that data produced 
on tested FT children are representative of the population of 
all children in the FT program. If, however, tested FT 
children are considerably different in background character- 
istics from rostered but untested children, we v/ould have to 
limit any conclusions from the National Evaluation to the 
group of tested oi potentially testable FT children in the 
program. The study v;e did here should be considered explora- 
tory. Only one background characteristic, race, was looked at. 
This is the only possible and meaningful comparison that could 
be made given the existing data tape. This comparison raises 
the disturbing possibility that rostered and tested FT children 
do differ on race in several sites. Because the sample is 
small, the conclusion arrived at is necessarily tentative. But 
it points decisively to the need for further study about the 
representativeness of test FT pupils for the whole FT population. 
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2 • COMPAR ISO N OV FOLLO\J THROUGH SCHOOLS KblD OTHER "DISADVANTAGED 
S CTIOOLS 

The question is: Ai^e FT schools representative of the 
wider population of discidvantaged schools? Follov/ Through is 
a compensatory program aimed at "disadvantaged children in the 
primary grades of schools throughout the nation." It is 
reasonable to a^k whether FT schools are reaching this special 
population, or rather, some subset of that population. We know 
enough about the history of Follow Through to expect that the 
practices follov/ed in selecting FT schools varied widely from 
one place to another, 

Elmore (1972) has commented: 

The process used to nominate and select Follow Through 
sites v/an neiLher an arbitrary and irrational construct 
of some bureaucratic imagination nor a willful and per- 
verse atterapt to undermine good experimental design. 
It was founded on a very rational desire to minimize 
administrative difficulties. 

Despite this, we know that the OEO Poverty Index was used 
to select FT schools, first to identify disadvantaged pupils and 
then, through aggregation of these data, to identify disadvan- 
taged schools.* One component of the Poverty Index is a measure 
of parental income, and an income level is also used to define 
schools eligible for Title I funds.** So there are good a priori 



* It should be pointed out that about one-third the FT pupils 
do not in fact fall within the limits defined by the OEO Index 
(SRI Longitudinal Evaluation of Selected Features of the National 
Follov; Through Program, March 1971) . 

** AFDC eligibility is also used in conjunction with income level. 
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reasons for using Title I schools as the comparison group re- 
proGcntinc/ the wider population of disadvantaged children. The 
investigciLion reported here asks how far FT schools are repre- 
sentative of the Title I populcition. 

Since the analy.sis depends on existing data sources rather 
than purpose-gathered data, it was natural that a very limited 
rcincje of comparison criteria could be found. In the end only 
one has been used: the school mean reading test score. This 
requires some justification, not just because it is an imperfect 
proxy for many other important background which it would be 
interesting to take into account.* The problem is simply that 
test score differences betv/een FT and Title I schools might be 
explained in terms of the effects of the FT program. This 
difficulty can be avoided in large measure if it is accepted 
\that the test scores of pupils v/ho could not have experienced 
Follow Through are an adequate means of characterizing the 
populations of the schools. Thus, the data presented here will 
ifefer to pupils in the FT and Title I schools who were too old 
to have been involved with Follov; Through. The assumptions • is 
that the test scores of these pupils in higher grades reflect 
important characterisitics of the populations of these schools. 
Further, it is assumed that the populations of these schools are 



f It midht be poiated out that school level variations in 
tested achievement are closely associated with school level 
variations in social background variables such as the traditional 
fneasure^ of socio-economic status. 



-41- 

nuf f icicntly stable, at least over tVie short term of two or 
three years, to make use of test scores in this v/ay. In fact 
we know that for these data, the grade mean scores at one 
grade level correlate very highly with the grade means at 
another level. Thus, for the New York schools, the correlation 
between r^chool mean reading f;cores in second grade and fourth 
grade is 0.862 for a cohort of pupils (Acland, 1972). 

A further limitation concerns the unit of analysis: the 
school. We know that variations in tested achievement among 
pupils v;ithin the same school are nearly as large as variations 
among all pupils. That is to say, within school variations are 
typically 60%-8 0% of the variation among the whole population 
of pupils. Now, we know that Title I funds are meant to be 
allocated to particular pupils within the schools. Rather 
than use Title I funds for general improvements to the schools, 
they are meant to be used for the most needy pupils. If this 
practice V7ere followed in fact, the correct comparison would 
be between disadvantaged pupils in the FT program and those 
pupils v^ithin Title I schools who should be receiving Title I 
benefits. We are dealing here with school average scores which 
do not tell us about special sub-groups within the school. 

On the positive side, it may be pointed out that Title I 
funds may, in reality, be distributed in a great variety of 
ways, some of which diverge from the guidelines concerning 
allocation. 

Murphy (1973) , for example, has pointed out: 
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Currently it is not even clear to what extent Title I 
is expended on eligible disadv^intaged children in 
poverty neighborhoods. Kven v;hcjn it reaches them, it 
is uncertain that the money buys tservices in addition 
to the level provided other school childJren in each 
di.strict. 

It v:as necessary to limit the study to large cities which 
had more than one of two FT schools. Constraints imposed by 
using existing data sources further limited the investigation. 
In the event two large cities wer*:^ studied, Baltimore and New 
York. Ac}:ieveinent test data v;ere collected for all schools 
in these cities. The comparison of school mean scores for 
FT schools and Title I schools is presented in Tables IV through 
VI. In all these Tables, school means are presented by grade 
level for most of the elementary grades. 

The Baltim.ore data (Table IV) suggest that the FT schools 
have considerably lov/er reading scores than the whole population 
^of elementary schools, and the same scores as Title I schools. 
In the third grade, for example, FT schools are seen to score, 
Cjn the average, around three months below the city-wide average. 
I It may be added that the Baltim.ore city average is appre- 
ciably lov/er than the national norm for large cities (Baltimore 
Schools, 1971). These data, then, indicate that Follow Through 
really does reach the target population, at least in terms of 
the assumptions which have been defined here. 

A less consistent finding emerges from the New York City 
fiata (Tables V and VI). Two boroughs have been chosen, which 
lp.ad the : largest, number of Follow Through schools, Manhattan and 
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TABLE IV 

IVvLTIMORE SCHOOLS — GRADE EQUIVALENT SCORES 
FOR SCIIOOL;:^ ON lOV.'A TEST OF BASIC SKILLS 
(/vveragc of Vocabulary, Read.ing, Language 
Sl:ills, Vvork-Study Skills arid Arithmetic 
Skills Subtests) . BY GR^vDE hUD BY YEAR AND 
BY "VJHETHER FT SCHOOL OR NOT" 



1969 






3 




4 




5 




6 


N 


FT Schools 


Mean 


2. 


473 


3. 


215 


4 . 


,005 


5 , 


.167 


13 




S.D. 


0. 


195 


0 . 


248 


0 , 


.294 


0. 


.284 




Title I Schools 


Mean 


2. 


504 


3. 


277 


4. 


.135 


5. 


.187 


74 




S.D. 


0. 


173 


0. 


252 


0. 


.271 


0. 


,338 




All Schools 


Mean 


2. 


791 


3. 


561 


4. 


.510 


5. 


,601 


155 




S.D. 


0. 


510 


0. 


568 


0. 


.633 


0. 


.686 




197 0 


Grace ; 




2 




3 




4 




5 


N 


FT Schools 




2. 


669 


3. 


362 


4, 


.208 


5, 


.409 


13 




S.D. 


0. 


330 


0. 


340 


0, 


.309 


0. 


,378 




Title I Schools 


Mear. 


2. 


654 


3. 


311 


4, 


.268 


5. 


.321 


74 




S.D. 


0. 


311 


0. 


274 


0, 


.345 


0, 


.476 




All Schools 


Mean 


2. 


926 


3. 


680 


4, 


.664 


5. 


.780 


155 




S.D. 


0. 


522 


0. 


598 


0, 


.639 


0, 


.733 





Brooklyn. In the first, nearly all the schools v;Qre defined as 
olicjible for Title I funds, so in this case a comparison has 
been made between FT schools and all the schools in the borough 
{Table V) • 

It is evident tluit FT schools in Manhattan tend to score 
bolov; the average for the borough. There are variations from 
grade to grade; for example, in grade 3 (1970) FT schools 
score roughly three months ahead of the other schools. But in 
general, the Follov; Through schools are about a month or so 
behind. From this it seems safe to conclude that FT schools 
are not atypical of Manhattan schools; at least they are not 
clearly superior to the average. 

The results for Brooklyn schools (Table VI) are suspect 
^^because they are based on a very small nuniber of FT schools. 
'The differences in school averages for FT and Title I schools 
lie in no consistent direction here, but, as in the case of 
M^inhattan schools, it appears that FT schools are roughly 
cbmparable to Title I schools. Certainly, FT schools are 
neither clearly superior or inferior to other "disadvantaged" 
schools. 

Perhaps the greatest weakness of this analysis is that we 
lack a consensus about the appropriate comparison group. It 
would, after all, be surprising if there was agreement about 
the target population of compensatory programs. Allowing that, 
t^he case' can be made that the Title I population approximates 
to this comparison group, and if this assumption is granted. 



TABLE V 



ELEJihiKTARY SCilOOLS IN r-:;,r'i!ATTAN . GP^ADE EQUIVALENT SCORES ON 
THE MiJ'PROPOLI'j';.?; ACIIIEVr,I''ir,NT TEST (Average Score from VJord 
Knov;;i odqe and Reading Subtests) BY GIIADE, AND YEAR, AND BY 
"WHETi^ER FOLLOW THROUGH SCHOOL OR NOT" 

(Note: All hut 105; of the schools in ManhottEn are eligible 
for 'j'jtlo I fiu'dn, so th'-j compar j.Kon presented here is be- 
tween FT and all the schools in Manhattan.) 



]970 


Grade: 








3 




4 




5 




6 


N 


FT Schools 


Mean 


2 . 


715 


3. 


791 


4 


.125 


4 


.916 


5 


.725 


8 




S.D. 


0. 


345 


0. 


391 


0 


.335 


0 


.468 


0 


.179 




All Schools 


Mean 


2. 


806 


3. 


481 


4 


.372 


5 


.155 


5 


.934 


87 




S.D. 


0. 


558 


0. 


700 


0 


.807 


1 


.058 


1 


.101 




1971 


Grade: 




2* 




3 




4 




5 




6 


N 


Ft Schools 


Mean 


2. 


757 


3. 


238 


4 


.066 


4 


.812 


5 


.■575 


8 




S.D. 


0. 


382 


0. 


573 


0 


.543 


0 


.868 


0 


.432 




All Schools; 


Mean 


2. 


G75 


3. 


355 


4 


.012 




.837 


5 


.865 


87 




S.D. 


0. 


376 


0. 


747 


0 


.956 


1 


.105 


1 


.041 





Pupils in second grade in 1971 could have had FT experience 
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TABI.E VI 

BROOKLYK SCHOOLS. GP.ADE EQUIVALENT SCORES 
ON THE MKTROPOLITAN ACHIEVEMENT TEST (Average 
Score BaKcd on Roading and Word Knowledge 
Subtest t-) BY GRADE AND YEAR AND WIETIIER" FOLLOW 
THROUGH SCHOOL OR TITLE I SCHOOL. 



1970 



Grade : 



N based on 



FT Schools 



Mean 
S.D. 



Title I Schoolr. Moan 

S.D. 



2.657 
0.268 



3.590 
0.302 



4.127 
0.181 



5.283 
-.359 



5.800 
0.002 



2.555 3.229 3.946 4.716 5.431 
0.314 0.382 0.482 0.642 0.750 



143 



\ 1971 



Grc-'de : 



Mean 
S.D. 



FT Schools 

Title I Schools Mean 

S.D. 



2.462* 3.107 3.872 4.700 5.500 

0.127 0.168 0.411 0.660 0.0** 

2.555 3.073 3.750 4.488 5.545 

0.281 0.423 0.582 0.713 0.767 



N based on 



"143 



\ * Pupils in second grade could have had FT experience. 

\ ** One school in this cell. 



1 1 
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those analyses de^^on^JtratG tlicit the FT sample ir not seriously un-- 
representative of tli:; disadvantaged group. Certainly, the 
investigation has been restricted to tv/o local areas, and it 
may be that evidence of unrepresentativcness could bo found 
with a wider-reaching analysis. IIov:ever, the indications of 
this analysis are that the FT sample reaches it target popula- 
tion . 

3 . COMPAR ISO N OF PUP IL S WITHIN THE I^ATIO N AL EVALUATION SAMPLE 

A further way of looking at the question of the represen- 
tativeness of the Follow Through pupils makes use of the existing 
data gathered for the national evaluation. Again, our knowledge 
of procedures used to identify FT and NFT schools is sufricient 
to warn against expecting too much. It has already been men- 
tioned, for example, that an unexpectedly small proportion of 
pupils in FT schools fall belov; the OEO Poverty line. Similarly, 
the selection of NFT schools leaves room for doubting their 
utility as comparison groups* Analysis of the national evalu- 
ation data bears this out. Stanford Research Institute found 
that little more than 40% of the FT/NFT matches were "good" in 
terms of their definition,* 



* Seven baseline variables were used to estimate the quality of 
the match betv/een FT and NFT samples, "For each project, the 
number of these variables showing a FT/NFT difference of 10 per- 
centage points or more v/as tabulated. Three or less discre- 
pancies of 10 percent or r;ioro results in the classification of 
an FT/:;FT conpariso-. r.s a *good* match." [p. 278, SRI Interim 
evaluation of the national Follow Through Program 1969-1971 ^ 
February 1973] 



The strategy ado]vtcd hero v/an to compare tMo charactorintics 
of tv;o cjrcMip:; of lu-p.il.s, both railing within the FT sample, one 
boiiKj teslod durin<] the year, the other being excluded from the 
testing procjram, O'hcir-.C: comparisions were replicated for each 
entering cohort, t])i't is, the entering K and entering • first 
grddes in the three yc:irs 190^^-70, 1970-71 and 1971-72. There 
are tv;o is-jiics which these ccouparisons address. First it will 
be anked :i f the tertr d pupilr;, v/ho form the basis for the key 
analyses of the national evaluation, are representative of the 
larger sar. pie covered by the Follow Through program. Second, 
it v/ill be asked if th(-a;e are substantial variations in the 
background characteristics of the tested pupils from one year 
to the next. 

We had to rely on pupil background inforraation contained 
in the SRI Index Tape, information which v/as limited in scope. 
In the event v/e decided to use an index of th6 racial composi- 
tion of the tested and untested groups; the proportion of 
blacks. For each site, the proportion of black pupils was com- 
puted for FT- tested and FT-untested groups. The results are pre- 
sented in Table VII. Admittedly, this variable does not cap- 
ture many aspects of background differences. But, on the 
positive side, racial background has been regarded, traditionally, 
as one of the key baseline variables in most evaluation analyses. 
Those who have sought to evaluate the effectiveness of programs 
such as Follow Through have been sensitive, above all, to the 
possibi^lities of variations in racial background as a causative 



-4 9- 



or coni" outiding fctctor, Tlic cor.ipar.i r:cnfj bolv:ocn FT-tcrjtod and 
KT-nor.L(-:tcd art.' ; -ruiion i < d .in Tablf: VII. T)io FT-tcri.od group 
are iho5:C vho rccoivod y^oi:,^' kind o.C cichiovciM. -nt tt^st during the. 
year in question, either during thn fall or during the spring, 

Tv;o obncrv [..ioiiij ci-ji ]>c inadc 1. voin thc.sc findings. The 
first is that there is a hir,h dog mo of stability in the racial 
cornpo.v..! tion of successive I'T- tested cohorts, Tc^ke, for example, 
site 03 . 09. The. proper Uiori of blric];K in the TT-tcstod sample 
changer, from 5G.7': in Cchorb I to 5G.3?, in Cohort II to 49 .0% 
in Cohort III. Siiul]arly high consistency can be found for 
other sites, with one exception (Ol.O^J. 

The second ob5?;orvation^; is liiuited to a rather small 
numb er of sites in which v/e have both FT-tested and FT-non- 
tested pupils. For this .Sir.all nvinbcr of cases we find that 
the tv7o groups are generally similar in terms of racial compo- 
sition, but that there are dramatic exceptions to this rule. 
For example, site 01.14 has 66.1% blacks in the FT-tested sample 
in 1969-71 compared to 29.1% black in the FT-nontested sample. 
Similar differences can be found in other sites (e.g., 03.07, 
1969-70). On the other hand, there are also sites in which the 
tested and untested pupils have very similar racial composition. 
For example, in site 05.10 the FT tested pupils were 85% black 
compared to the FT-untcsted group which was 83% black. The 
same holds true forthejiext year. 

Not nurprir. i ngly , p'"r}\;;^s, w^:^ connot reach firm conclusions 
frora these analyses, but one cautious implications might be 
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TABLE VII 



COMPARISON or FT-Vi:r,TED AND FT-NON-TESTED PUPILvS, 
BY YEAR, BY SITE. PI'RCENTACE BLACK FOR ENTERING 
GRADl;; (EITHER K OR 1ST) 







19 0 9 


-70 


1*70-71 


. 1971-7i>"' 


SITE 




Tcfitcd 


Non-Tor;ted 


Tos Led 


Non-Tested 


Tcs ted 


Non-Tested 


0104 




25.0 


mm 


90 . 4 






79.5 


0114 




66 . 1 


29.1 


68 . 1 


56 . 9 




42 . 2 


0201 




53 . 4 


mm 


48.1 






51.9 


0204 




2 . 7 


8 . 5 


3 . 0 




3 . 6 




0302 




98.5 


87 . 7 


98.9 






98.8 


0307 




69.9 


29 .1 


58.2 


49.2 


60.4 




0308 




25 . 0 




20.3 




20.5 




0309 




56.7 




56.3 




49.6 




0510 




85.1 


83.2 


91.3 


99.1 


95.2 




A C A C 

0506 




9 8.0 




97.6 


92.9 


98.0 




0C04 




5.8 




4.4 




10.7 




0701 




89.0 




91.9 




92.0 




0711 




69.2 




82.0 


63.3 


74.8 




0712 








0.5 




0.6 




0801 




46.3 




57.7 




38.3 




0804 




24. 1 




36.4 




33.3 




0901 




75.7 


100.0 


86.1 


100.0 




96.2 


0902 




66.7 




73.0 




76.0 




1002 




12.3 




23.3 




12.7 


25.0 


1102 




26.8 




28.3 






27.5 


1301. 




81.4 




97.2 


80.1 


90.2 





suggofitecl. The reir.ults rair^e the possibility that the sample 
of pupilo selected for testing may be unrepresentative of the 
v;hole FT sample. It is likely that the degree of representative- 
ness vnrios from s;ite to site* Naturally ^ it is not possible 
to say if this happens. Just what determines this cannot be 
cliscoverecl with these data, but the process of sample selection 
may v:cl] be bia.sod inadvertently. Whatever the cause^ the 
consu^qucncG is that one should exercise extreme caution in 
making any assumptions about the referent population of the 
Follovv Through sample ♦ 
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